Information in this document is provided in
connection with Intel products. No license, express or
implied, by estoppel or otherwise, to any intellectual
property rights is granted by this document. Except as
provided in Intel's Terms and Conditions of Sale for such
products, Intel assumes no liability whatsoever, and
Intel disclaims any express or implied warranty, relating
to sale and/or use of Intel products including liability
or warranties relating to fitness for a particular
purpose, merchantability, or infringement of any patent,
copyright or other intellectual property right. Intel
products are not intended for use in medical, life
saving, or life sustaining applications. Intel may make
changes to specifications and product descriptions at any
time, without notice. Copyright (c) Intel Corporation 1997. Third-party brands and names are the property of their respective owners. |
The internal speculative nature of the Pentium Pro family and Pentium® II Architecture needs the ability to re-order certain memory operations, where allowed, to enable the extraction of the maximum processor performance. The Pentium Pro family and Pentium II processors provide the memory consistency model called speculative processor ordering.
Memory requests to the L2 cache or system memory go through the memory reorder buffer, which functions as a scheduling and dispatch station. This unit keeps track of all memory requests and is able to reorder some requests (weakly ordered) to prevent blocks and improve throughput. When processor ordering is violated, the Pentium Pro family and Pentium II implementation aborts all instructions under execution, beginning with the illegally completed load, and then resumes. The Pentium Pro family Pentium II processors have much deeper write buffer than the previous Intel486™ and Pentium processors. The write buffer is very dynamic, which has between 0 and 12 entries, whereas the Pentium processor has only two write buffers, one corresponding to each of the pipelines.
Memory ordering is predominantly a multiprocessor (MP) issue. Although a multiprocessor style, system environment can be constructed from a single CPU and I/O bus masters. Therefore, the uniprocessor (UP) memory accesses are the main concern in this document.
This document describes the memory ordering rules. A memory access A passes another memory access B means that B precedes A in the processor's Von Neumann execution stream, but that the processor executes A before B has completed, or even begun, because B and A are considered not to conflict. To simplify, the program flow is B then A, but the execution is A then B. The next section will illustrate in detail all the rules that govern the memory access ordering.
mov cx, 0aaah
xor bx, bx
mov word ptr ds:[esi], cx
mov bx, word ptr
ds:[esi] ; This will not pass the previous conflicting write
cmp bx, 0aaah ; bx will contain 0aaah instead of 0
Speculative memory types include Write Back (WB), Uncached Speculative Write Combined (USWC), Write Through (WT), and Write Protected (WP). For a detailed description of these memory types, please refer to the collateral titled "Information on Caches and Optimizing Memory Transfers".
This is a result of speculative execution in the P6 implementation. Software and systems should not depend on the details of when, and to what addresses, reads to these memory types occur for correctness. Particular implementations may generate these read prefetches or speculations at arbitrary times.
Due to the length of the pipeline, reads do pass locked instructions, but upon execution of the locked instruction, the entire pipeline is flushed and the instruction prefetech restarts after the locked instruction.
Writes occur in the order of instruction execution. With the WB and UC memory types, the bus trace might not appear in order, but the writes committed to the cache are in order.
xor bx, bx ; clear the CF
mov bx, 5555h
mov cx, 0aaah
mov word ptr ds:[esi], cx
jnc next
mov word ptr ds:[esi], bx ;This is never executed
next: ;word ptr ds:[esi] still contains 0aaah
Writes stored in the write buffer are always written to memory in program order.
For a description of the serializing instructions, please refer to section 2.4. Serializing Instructions.
For a description of the synchronizing instructions, please refer to section 2.5. Locked Operations.
Self-modifying code is detected and signaled on the next instruction boundary. Signaling means that all pre-fetched instructions are flushed from the pipeline. The processor restarts prefetch beginning at the instruction following the write. Self-modifying code, therefore, exacts a great performance penalty.
mov ax, code_alias_seg
mov fs, ax
mov edi, offset foo
xor ebx, ebx
mov byte ptr fs:[edi], 90h ; Replace the code segment: code_alias_seg at foo
mov dword ptr fs:[edi+1], 90909090h ; with a NOP
foo:
mov ebx, 0fca50fh ; This will be overwritten by the last two moves.
cmp ebx, 0h
The register ebx should contain a 0h, since the code after "foo:" will not be executed until the 2 moves before "foo:" are done. Thus, the line "mov ebx, 0fca50fh" is overwritten by 5 NOPs.
Remember, only the execution unit is dynamic and speculative.
Due to the length of the processor's instruction pipeline, instruction fetches do often pass serializing instructions. However, upon execution of the serializing instruction, the entire pipeline is flushed and the instruction prefetch restarts after the serializing instruction.
Serializing instructions constrain speculative execution. Therefore, when serializing instructions are executed, the dynamic execution feature of the Pentium Pro Family and Pentium II is defeated. The following instructions are serializing:
Move to special register (include WRMSR);
INVD, INVPG, WBINVD;
IRET, IRETD, LGDT, LLDT, LIDT, LTR;
CPUID;
RSM;
They wait for all previous instructions to complete, and for all write instructions buffered by the CPU to drain.
Instruction prefetch may have occurred before a serializing instruction, but actual instruction fetch and execution does not occur; even if it does occur, a re-fetch is necessary.
Locked operations include XCHG, CMXCHG, CMPXCHG8B, and read-modify-write instructions to which the LOCK prefix can be applied.
Locked operations wait for all previous instructions to complete. They wait for on-chip buffers to drain, and for external buffers to be emptied..
Locked operations synchronize data, but not instruction fetch or TLB (Translation Lookaside Buffer) miss handling. Data may be present in a cache or TLB, due to the speculative cacheability feature, which means that a lock cannot be used to prevent data from being fetched into a cache or TLB.
This means that no other operation can occur between the Read and Write of a locked RMW operation.
Both type of instructions wait for all previous instructions to complete, for all writes buffered by the CPU to drain and for all previous stores to be globally observed.
Subsequent instructions, other than instruction fetch and page table walking, do not begin execution until the IN or OUT instruction has completed.
Instruction fetches (loads) and page table walks will pass I/O operations on the bus. But, the speculative snooping mechanism will ensure that they are re-performed, if a coherency side effect of the I/O is produced before the I/O cycle is completed.
The memory ordering model for page table walks is more relaxed in the Pentium Pro family and Pentium II processors than in previous processors.
Page table walking to satisfy TLB (Translation Lookaside Buffer) misses can be performed speculatively and out-of-order; page table walks are subject to speculative cacheability.
The instructions are strongly ordered when instructions are executed in program order, resulting in accesses being issued in the order implied by the program. The setting of the accessed and dirty is treated the same as locked atomic RMW synchronization instruction.
It is recommended that software written to run on the Pentium Pro Family and Pentium II processors assume the processor-order model or a weaker memory-ordering model.
Despite the fact that processor ordering is supported by the Pentium Pro family and Pentium II processors and previous CPUs, new code developments should employ locked atomic RMW (Read-Modify-Write) instructions like XCHG for synchronization, instead of relying on memory access ordering. Note that any of these strongly ordered instructions will cause a performance hit.
In a single-processor system for memory regions defined as write-back cacheable, the general rule of thumb is: